Ct In Action

, December 23, 2008

Rules of thumb for translating sequential code to Ct code

We walk through some examples to demonstrate how Ct boosts the productivity and performance for a variety of application domains. We take a step-by-step approach, to make clear some guidelines for porting to Ct. In particular, we provide rules of thumb for translating sequential code to Ct code.

Black-Scholes

Option pricing is a computation-hungry, important application in modern financial engineering. Black-Scholes is a well-accepted analytical model for European option pricing. We use it here as an exemplar for C/C++ to Ct migration. The code below shows the sequential C code.


1 float s[N], x[N], r[N], v[N], t[N];
2 float result[N]; 
3 for(int i = 0; i < N; i++) { 
4 float d1 = s[i] / ln(x[i]); 
5 d1 += (r[i] + v[i] * v[i] * 0.5f) * t[i]; 
6 d1 /= sqrt(t[i]); 
7 float d2 = d1 - sqrt(t[i]); 
8 result[i] = x[i] * exp(r[i] * t[i]) * 
9 ( 1.0f - CND(d2)) + (-s[i]) * (1.0f - CND(d1)); 
10 }

The code below shows its Ct counterpart. The two pieces of code are very similar (lines 4-9).


0 #include <ct.h> 
1 T s[N], x[N], r[N], v[N], t[N]; 
2 T result[N]; 
3 TVEC<T> S(s, N), X(x, N), R(r, N), V(v, N), T(t, N); 
4 TVEC<T> d1 = S / ln(X); 
5 d1 += (R + V * V * 0.5f) * T; 
6 d1 /= sqrt(T); 
7 TVEC<T> d2 = d1 - sqrt(T); 
8 TVEC<T> result = X * exp(R * T) * 
9 ( 1.0f - CND(d2)) + (-S) * (1.0f - CND(d1));

The only differences are these:

Ct needs to include the ct.h header file (line 0).
Ct adds the TVEC declarations (line 3).
Ct exempts programmers from having to manipulate arrays with loops and subscripts (lines 3-9).
The Ct version co-exists well with C++'s parametric polymorphism, allowing the code to be instantiated with different types T.

The desirable tradeoff here is that the coding overhead is small when migrating to Ct, but you get highly efficient vectorized, parallelized, and forward-scaling code. In contrast, a manually vectorized version using MMX/SSE intrinsics has 51 lines of code, and the manual parallelization using threads requires an additional 20+ lines of code. When the underlying hardware or OS changes, you may need to modify the code to use new instrinsics and change the number of threads or the threading primitives (e.g., pthreads).

The code below shows how a Cumulative Normal Distribution function, CND, is implemented in C and Ct, respectively. Though the C code can be translated into Ct easily, we use a Ct Application Library function, Ct::Polynomial::eval, to accelerate the polynomial evaluation. Our data show that for a 5th-order polynomial, the optimized Ct library yields approximately 3X speedup over the naive implementation with negligible precision loss.


float CND(float d) { // The C version 
   ... ... 
  w = 0.31938153f * k - 
         0.356563782f * k * k +
          1.781477937f * k * k * k - 
          1.821255978f * k * k * k * k + 
          1.330274429f * k * k * k * k * k; 
  ... ... }

template <typename T> 
TVEC<T> CND(TVEC<T> d) { // The Ct version
 ... ... static T coefficients[] = {
    0.0f,   
       0.31938153f, 
       0.356563782f, 
       1.781477937f, 
       1.821255978f, 
       1.330274429f};

w = Ct::Polynomial::eval(k, 5/*order*/, coefficients); ... ... }

Rule of Thumb I:


 for(int i = 0; i < N; i++) 
 ... data[i] ... 
 ->
... data ...

Convolution

Convolution is a widely used function in many application domains ranging from signal/image processing to statistics, to geophysics. Compared with the very parallel Black-Scholes, the computation pattern of Convolution is slightly trickier. In particular, programmers have several mechanisms at their disposal and we will explore the tradeoffs between these approaches.

Convolution algorithm illustration (1 dimensional)

The figure above is a typical 1D convolution algorithm, where y is a data set, and x is a kernel sliding through the data set. The code below gives the C implementation. As compared to the figure above, you may find this loop structure is more complicated and the array access pattern is more irregular (particularly y[i – j] on Line 5).


float x[M], y[N], z[N]; 
for (int i = 0; i < N; i++) {
     z[i] = 0.0f; 
     for (int j = 0; j < M; j++) 
          z[i] += x[j] * y[i-j]; 
     } 
}

Our first question is how to map the two-level loops to TVECs. Obviously we want to abstract the data set, y, to be a TVEC. In this regard, we peel the outer loop and change all the occurrences of y to Y, as shown below.


T x[M], y[N], z[N];
TVEC<T> Y(y, N), I = TVEC<T>::index(0, N); 
TVEC<T> Z = TVEC<T>::constant(0.0f, N) ; 
for (int j = 0 ; j < M; j++) 
    Z += x[j] * Y[I-j];

The second question is how to represent y[i – j] in Ct. Because i is a loop induction variable incremented by a step value 1, we map it to an identity vector, I, which results in Y[I – j] . Ct's C++ front-end reinterprets the [] operator into a gather operator, gathering values from Y according to an index vector, I - j.

This porting is straightforward but we can't claim this solution is ideal for performance, because the gather operator is expensive on most architectures. Experienced Ct programmers, when understanding the algorithm, may resort to a more lightweight operator, shiftPermute. If you look at Figure 4 from a different perspective, that is, the kernel x is fixed while the data set y is sliding leftward, the result is equivalent. The optimized implementation is shown in the code below.


T x[M], y[N], z[N];
TVEC<T> Y(y, N) ; 
TVEC<T> Z = TVEC<
   T> ::constant(0.0f, N) ; 
   for (int j = 0 ; j < M ; j++) Z += x[i]*Y.shiftPermute(1);

It is worthwhile to mention that the Ct implementation can be extended to 2D convolutions with minimal effort.

Rule of Thumb II:

 

for(int i = 0; i < M; i++) 
    for(int j = 0; j < N; j++) 
       ... data[i + a, j + b] ... 
?
    ... data.shiftPermute(a, b) ...

Sparse Matrix Vector Product (SMVP)

Linear algebra, particularly matrix operations, is quite common in high-performance computing, physics simulation, aspects of machine learning, and many Recognition, Mining, and Synthesis (RMS) applications. Sparse matrices are extremely useful for cases where the particular algebraic formulation of a problem sparsely populates elements in the matrix with meaningful values.

An example is large-scale physics simulations. In such cases, the logical size of a dense matrix might be 100s of megabytes, where a sparse matrix representation that only stores non-zero matrix elements would perhaps only hold 1 megabyte of data. Unlike dense matrices, whose control paths and data access patterns are highly predictable, sparse matrices are much more hard in terms of the diversity of data structures and the irregularity of algorithms. In this section, we use a common kernel in gaming and RMS applications, Sparse Matrix Vector Product (SMVP), to demonstrate how a sparse matrix multiplied by a vector is implemented with Ct.

We use a Compressed Sparse Column (CSC) format. The basic idea of CSC is to only store the non-zero elements of the matrix in the column order, and with each non-zero element, the programmer also stores the row index. Consider the sparse matrix below.


A = [[0 1 0 0 0] 
        [2 0 3 4 0] 
        [0 5 0 0 6] 
        [0 7 0 0 8] 
        [0 0 9 0 0]]

The matrix is stored as three arrays:

nz_mval: the nonzero values, in column major order.
row_idx: the row indices for nonzero values.
col_ptr: the column pointers. col_ptr[i] tell the values of the i-th column start from which index into the nz_mval array).


mval = [2 1 5 7 3 9 4 6 8] 
row_idx = [1 0 2 3 1 4 1 2 3] 
col_ptr = [0 1 4 6 8 9]
vval = [1 2 3 4 5] // the vector values

The schema for computing the SMVP is shown below.


1 for (c = 0; c < col_num; c++) { 
2 for (e = col_ptr[c]; e < col_ptr[c + 1]; e++){ 
3 int r = row_idx[e]; 
4 product[r] += mval[e] /* A[r][c] */ * vval[c]; 
5 } 
6 }

It is worth observing some of the computing patterns to comprehend the implications for porting to Ct:

The two-level loops are more irregular than the aforementioned examples. Typically this kind of loop structure can be mapped to a two-level nested vector, as shown below
```
[ 
    [...],//col_ptr[1]–col_ptr[0] elems
    [...],//col_ptr[2]–col_ptr[1] elems
       ... 
    [...],//col_ptr[col_num+1]– col_ptr[col_num] elems 
]
```
In the inner loop, vval[c] is used for col_ptr[c+1] – col_ptr[c] times, which can be viewed as a special kind of gather operation. In Ct, we have a dedicated operator, distribute, to replicate values of a vector for certain numbers of times specified by another vector.
The expression product[row_idx[e]]+= ... implies that we are performing what is called a combining-send, or alternatively a multi-reduction or combining-scatter.

By comprehending the patterns, we have the Ct implementation presented below:


1 vval = vval.distribute(col_ptr);                            //=> [1 2 2 2 3 3 4 5 5]
2 TVEC<T> product = mval * vval;             //=> [2 2 10 14 9 27 16 30 40] 
3 product.applyNesting(row_idx, ctNestedIndex); //=> [[2] [2 9 16] [10 30] [14 40] [27]] 
4 product = addReduce(product);                          //=> [2 27 40 54 27]

Rule of Thumb III:

for(int i = 0; i < num_segs; i++) for(int j = seg[i]; j < seg[i+1]; j++) ... data[j] ... ? data: [ [seg1], [seg2], ..., [segnum_segs] ]

-->

Rule of Thumb IV:


 for(int i = 0; i < M; i++) 
data[j] += ... ?

-->

addReduce(data);

Experimental Results

We measured the performance of Ct and sequential C implementations of representative data parallel operators and a set of real-world applications on an Intel Xeon processor E5345 platform (two 2.33GHz quad-core processors, 4GB memory), and plotted the speedup of Ct over C in Figure 5 and Figure 6. To present the benefits from the JIT compiler optimizations and the TRT separately, we configured Ct to run with different numbers of cores. The C implementations are compiled with Visual C++ 2005 compiler.

The figure below compares Ct's performance vs. C (compiled with O3-level optimizations on the Intel C Compiler) for a few key operators. These are common building blocks in many-core applications, though their use and mix is varied. It provides a reasonable baseline for assessing scalability based on the mix of Ct operators in your application. These operators can be categorized into two classes: add, max, sqrt, and exp belong to element-wise operators, while addReduce and addScan are collective communication operators.

The Ct implementations of almost all element-wise operators achieve 7-8X scalability when adding the number of cores from 1 to 8. When considering only one core, the Ct Compiler's aggressive vectorization also makes significant difference:

For add, Ct generated code achieves 2X speedup against a scalar implementation. Given SSE's vector width is 4, and the vectorized code is mixed with a lot of scalar code, the speedup is quite reasonable.
For max, the Ct implementation takes advantage of SSE's max instructions, however, the C version uses control flow (e.g., a > b ? a : b) that are more challenging to vectorize. Thus the speedup reaches up to 11X.
For sqrt, Ct generated code leverages SSE's sqrt instruction, while the C code relies on slow C runtime library implementation. As such, the Ct implementation achieves a significant speedup of 42X.
SSE doesn't have direct support for exp. However, Ct still generates highly efficient SSE code sequence, based on a look-up table and interpolation-based method that outperforms the C runtime library-based implementation by 15X.

Unlike element-wise operators that are embarrassingly parallel, collective communication operators have more complicated inter-thread communication patterns. In the meantime, the collective operators also impose challenges on local code vectorization.

The addReduce operator performs the summation of all elements of a vector. The Ct implementation achieves totally 93X speedup over the scalar implementation, where 12X comes from vectorization.
The addScan operator requires more complicated communication patterns. Again, the speedup achieved by our optimized code is as high as 31X, where 5X is from code vectorization.

In the future, Ct's adaptive compilation strategy will play an even more important role when new, throughput-oriented ISA extensions emerge (such as mask, cast/conversion, swizzle/shuffle, and gather/scatter).

Performance of select Ct primitives vs. C compiled with optimizations

The applications being studied are listed in the table below. These span applications from high-performance computing, financial computing, image processing, through physics simulation. Black-Scholes, Binomial Tree, and Monte Carlo Simulation are three widely used option pricing models, 2D Convolution is a typical signal processing kernel, and the Narrow-Phase Collision Detection is a compute-intensive component of gaming physics.

Table 1: Applications characterized under Ct

The figure below shows Ct's performance on Core 2 Quad machines, as compared to plain C code and hand-compiler tuned SSE code (using SSE intrinsics). The figure also indicates that Ct code has good scalability when increasing the number of cores from 1 to 8.

Application scaling with Ct

For single-thread performance, Ct achieves a speedup as high as 10X for Black-Scholes. Black-Scholes relies heavily on the performance of transcendental functions, namely exp, log, rcp and sqrt. SSE does provide efficient support for rcp and sqrt, while lacking support for exp and log. Typically, programmers fall back to a scalar loop and call corresponding C runtime functions for each element. Even though the rest of the program is well vectorized, the overall speedup of SSE over C is only 1.1X. Ct's 10X speedup is mainly attributed to JIT's use of vectorized implementation of such transcendental functions.

Monte Carlo Simulation has a 15.5X Ct-over-C speedup. Two factors contribute to the 8X speedup: first, Ct has a very fast, vectorized implementation of random number generator, while C has to resort to the C runtime function, rand; second, Monte Carlo Simulation heavily uses two transcendental functions, sin and cos, where Ct also has very efficient SSE-based implementations. Consequently, the speedup achieved by SSE is only 7.7X.

Single threaded Ct for Binomial Tree and Convolution achieve only 3.7X and 4.7X speedup, respectively, which is not surprising given that the two applications are not arithmetic intensive. Their SSE versions are slightly faster because the Visual C++ compiler uses a static compilation strategy that makes more aggressive optimizations affordable. An interesting note is that Binomial Tree suffers from many floating point underflow exceptions. Ct allows specifying lower numerical precision requirements. This enables the Ct Compiler to generate code under "flush-to-zero" mode, which speeds up the performance further by 3-4X. For cases when lower accuracy is not tolerable, we may specify F64 (namely double) as the base type of TVEC. Although the SIMD data width is reduced to half, the underflow exceptions are totally removed, which speeds up the performance of the TVEC<F32> version by 2.5X. It is trivial for Ct programmers to get the speedup because only TVEC declarations are changed (i.e., they do not have to change a single line of code).

Note that this graph shows the performance when running the same Ct binary with different hardware configurations. Looking forward, Ct provides nice forward scalability. Note that relatively inexperienced C/C++ programmers can get these performance benefits (nearly) for free on stock machines. Porting C implementations to their Ct counterparts takes minor effort, and the Lines-of-Code of the two implementations are almost 1:1.

(The example is just for illustration purposes, and we omitted some code for checking boundary conditions.)

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Ct In Action

Black-Scholes

Convolution

Sparse Matrix Vector Product (SMVP)

Experimental Results

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Ct In Action

Black-Scholes

Convolution

Sparse Matrix Vector Product (SMVP)

Experimental Results

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content